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ABSTRACT 

We propose a scheme for multi-layer representation of im¬ 
ages. The problem is first treated from an information- 
theoretic viewpoint where we analyze the behavior of differ¬ 
ent sources of information under a multi-layer data compres¬ 
sion framework and compare it with a single-stage (shallow) 
structure. We then consider the image data as the source 
of information and link the proposed representation scheme 
to the problem of multi-layer dictionary learning for visual 
data. For the current work we focus on the problem of image 
compression for a special class of images where we report 
a considerable performance boost in terms of PSNR at high 
compression ratios in comparison with the JPEG2000 codec. 

Index Terms — visual data representation, rate-distortion 
theory, lossy compression, image compression, dictionary 
learning 

1. INTRODUCTION 

Sparse data approximation is a fundamental problem in many 
areas of signal processing and machine learning. For dif¬ 
ferent tasks like multimedia compression, content identifi¬ 
cation, multi-class classification and representation learning, 
one aims at straightforward, concise and computationally fea¬ 
sible approximations. 

The above requirements, while being conflicting in na¬ 
ture, have been formulated and extensively studied under 
different concepts and applications like rate-distortion theory, 
approximate nearest neighbor search, vector quantization, 
dictionary learning and supervised/unsupervised learning in 
different disciplines. 

In this work, we try to address some of the issues consid¬ 
ered in these topics by asking the question, which data rep¬ 
resentation scheme is the most concise in terms of memory 
storage, the fastest in terms of computational complexity and 
the most accurate in terms of fidelity to the original data. 
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To this end, we propose a framework that could poten¬ 
tially be used in many different applications such as quality 
enhancement, denoising, impainting, visual recognition and 
joint compression-encryption. The general idea behind this 
approach being present in several earlier works ID, lID, we 
unify them together and treat the problem from a practically 
significant perspective along with an information-theoretic 
analysis. 

In particular, for this work, we consider the problem of 
image compression. We show that the proposed framework, 
when adapted to a particular class of images can gain a con¬ 
siderable compression performance increase compared to the 
JPEG2000 codec for the very low bit-rate regime on the im¬ 
ages belonging to the same class, in our application, face im¬ 
ages. 

Such a problem formulation is of great practical signif¬ 
icance for those applications where the significant amount 
of images of a similar nature, like facial/iris images in bio¬ 
metrics, medical images, remote sensing and astronomical 
images, are to be compressed and communicated. In this 
case, the usage of a generic codec whose basis vectors are not 
adapted to the statistics of image is known to be inefficient, 
especially in the low rate regime. In this case, the overhead 
for storing a common trained codebook might be minor in 
comparison to the gain for millions of images. 

The paper is organized as follows. In section [2] we briefly 
review the classic Shannon Rate-Distortion theory where the 
data are represented in one single layer. In section [3] we 
discuss the information-theoretic analysis of the multi-layer 
structure. Section [4] studies the behavior of Ltd. sources of 
information under the multi-layer structure. Section [5] con¬ 
siders images as the data to be treated within this framework 
where a short review of facial image compression in the lit¬ 
erature is also provided. The experimental results for image 
compression are discussed in section [6l Finally, we conclude 
the paper in section [71 



2. SHANNON RATE-DISTORTION THEORY: 
SHALLOW REPRESENTATION 
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Fig. 1: Encoding and decoding in Shannon’s shallow struc¬ 
ture. 

The trade-off between the concise representation of a 
source of information and the fidelity is theoretically treated 
and formulated by Shannnon in O. In this analysis, for the 
joint description of the outcomes of the sequence of random 
variables, = {Xi, • • • X^}, the measure of compactness 
is the compression rate defined as Rc = ^log 2 k and mea¬ 
sured in bits, if we store k codewords in a codebook C that 
each of them refer to a data point in the space of IZ'^. 

The k codewords, x'^'s are generated from a distribution 
p{x'^\x'^) and organized into a shallow codebook C as shown 
in Fig. [U Each codeword has the assigned index 1 < w < k. 
This codebook C = • • • ,x'^{k)} is shared between 

the encoder E and the decoder D. The sequence x'^ with 
an index w, represents a compressed counterpart of x^. It 
should be pointed out that the codebook C is overcomplete, 
since k = i.e., the number of codewords k is exponen¬ 

tial in n. Moreover, the representation is sparse since only 
one codeword x^{w) is used for the approximation of x^. 

This representation leads to a loss of quality that should 
be measured as a distortion between x^ and x^ . One widely 
used measure of distortion between x^ and x^ is the MSB, 
defined as d{x^,x^) = ^ - Xi)^. 

The Shannon theory relates these two concepts by defin¬ 
ing the rate-distortion function and relating it to the mutual in¬ 
formation between the sequence and its representation; hence 
paving the way for calculation of this function for various 
sources. 

More concretely, the rate-distortion theory states that in 
order to guarantee to have the expected distortion between 
x'^ and x'^, less than a threshold distortion value D, i.e., 
E[d{x'^, x'^)] < nD, the compression rate should be lower- 
bounded by the rate-distortion function Rc{D). This lower 
bound is proven to be equal to: 

R^{D)= min /(X;X). (1) 

p(x\x):E[d(X,X)]<D 

An important consequence of this theory states that, 
for memoryless sources of information emitting Ltd. se¬ 


quences, the distortion-rate function (an alternative to the 
rate-distortion function) is upper-bounded by that of the 
Gaussian source with the same variance cr^, and MSB distor¬ 
tion measure, as: 

<a^2-2«^ (2) 

These bounds, suggested by the Shannon’s theory of rate- 
distortion, however, are proven to be achieved only for the 
asymptotic case where the block-length n ^ oc. Consider 
for a fixed rate, any increase in the block-length n would 
lead to an exponential increase in the number of represen¬ 
tations SiS k = This means that, in the data repre¬ 

sentation language where several data points are to be stored 
in the memory and exhaustively matched in case queries are 
presented, one has to deal with an exponential complexity 
for both search and memory storage. Therefore, the current 
setup, while conceptually very important, cannot be appeal¬ 
ing for many practical scenarios. 

3. INFORMATION-THEORETIC ANALYSIS OF 
MULTI-LAYER REPRESENTATION 

Instead of the above single-layer (shallow) representation 
of information where we have a shallow codebook with 
k > and x'^{w) is the rc* representation vec¬ 

tor in C, consider the case where we have multiple code¬ 
books Cl" 'Cl, where the i* codebook consists of Ci = 
{x^(l), • • • ,x2{ki)}. The number of codewords, ki, or the 
corresponding rates will be specified later. 

Consider the final encoding or source approximation 
is done as X'^ = 0(Xf,-- - ,X£). Therefore, the rate- 
distortion function can be calculated from equation ([T]). The 
mutual information in this case can be bounded as 


/(X^X") = /(X^0(X^.•. ,X2)) 

= /(X”; X”) + ■ ■ ■ + /(X"; X£|Xi" + ■ ■ ■ + X£_i) 

^ ^ _/ 

nRi uRl 


L 

= ^J(X";Xr|X" + ...+Xr_i) 

i=l 

= n(Ri + • • • + Rl)‘ 

(3) 


The important consequence of the developments in equa¬ 
tion dS]) is that, to achieve a high rate Rc which requires expo¬ 
nential storage and computational complexities in the shallow 
representation (due to k > one can achieve a tar¬ 

geted Rc with L codebooks each with very low rates such that 


(2TiRc _ e2'nRi (2 TlRl 
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or equivalently, 


L 

i=l 

Therefore, the exponential nature of the required shallow 
codebook size for high rates is achieved by multiplication of 
smaller codebook sizes, i.e., the equivalent alphabet size will 
be: 

L 

keqq — ( 5 ) 
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Fig. 2: The encoding structure in the multi-layer representa¬ 
tion. Ei, in the general formulation, has the knowledge of 

Fig. [2| sketches the structure of codebooks and decoding 
in the multi-layer representaion. 

3.1. Multi-Layer Additive Structure 

Suppose the special case when we have the reconstruction 

function 0(') to be additive, i.e., X = Xi H- V Xl. Given 

a realization of the source, the decoder in this case consists 
of finding the Euclidean nearest neighbor of the sequence 
within the codewords of the first codebook, calculating the 
error of estimation and passing it to the next stage and repeat¬ 
ing the same procedure until the last stage where the over¬ 
all error will be equivalent to the error in the last stage since, 
Xf = Qi{X%Xl^ = Qi(X--Xf),... ,Xl = Ql{X^- 
(Xf + • • • + X2_i)), where Qi(-) denotes the Vector Quna- 
tizer for the i* stage with distortion Di. 

Moreover, for the Gaussian source Xi ^ A/’(0,cr^) we 

have that i?i = R 2 = ■ ■ Rl = 

^logti ^DL^ ) where Di stands for the corresponding distor- 
tion, and Dl = D. 

The decoding as well as memory complexities will be 

k 

0{ instead of 0(2^^^) in the shallow structure. 

i=l 

Reducing the multi-stage encoding function 0(') to the 
addition operator, while being simple and intuitive, reduces 
the optimality since in general 

+ + ,Xl) 


and /(X; Xi • H- Xl) cannot be decomposed directly to 
conditional terms as in equation (O. 

This issue, while reducing optimality due to some in¬ 
formation loss in the addition operations, keeps the great 
advantage of breaking the exponential complexities of one 
huge shallow structure to several codebooks of considerably 
smaller sizes. In addition, a practical question of learnability 
of exponential codebook in the shallow structure is infeasible 
and requires also exponential number of training samples. In 
contrast, the multi-layer structure can be easily trained for 
low rates Ri . 

In section m we simulate the performance of this scheme 
for i.i.d. sources of information and show that this loss is not 
a limiting factor. 


4. MULTI-LAYER REPRESENTATION OF /././). 
SOURCES FOR SYNTHETIC DATA 


Consider the stationary ergodic source X^ with Xi ^ px{x). 
The realizations of this source are to be represented with 
codebooks Ci, • • • ,Cl each with ki codewords. The decod¬ 
ing is done as described in section [TTI 

For the encoding at each stage, using the stationarity and 
ergodicity assumptions on the source and therefore the spe¬ 
cific geometry imposed to the data distribution in as n 
grows large, we design the codewords in different codebooks 
very efficiently using only random codewords that are prop¬ 
erly normalized. 

Suppose the case where Xi ^ A/’(0, cr^) for all 1 < i < n, 
which means that the data are Ltd.. In this case, the data 
is concentrating around a spherical shell with radius p = 
s/n^, as n grows large. 

For the first stage of encoding, suppose we want to com¬ 
press the data to the rate Ri. The achievable distortion for 
this stage and n large enough is given by equation Q as 
Dl = which is achieved for optimal codebook de¬ 

sign, in this case with random structure. 

Due to the optimality proved for this hypothetical case in 
terms of MSB distortion, one can conclude the orthogonality 
of the vector of estimation with its error (due to the principle 
of orthogonality), i.e.. 

Therefore, from the law of cosines, one can confirm that the 
variance of the codewords of the first codebook is: 

( 6 ) 

Extending the argument to other layers, it can be concluded 
that the variance of the codewords of the i* layer is given as: 


= 




(7) 





































4.1. Design of Random Codebooks for Multi-Layer Rep¬ 
resentation 

The use of random codebooks is appealing both in theory and 
for practical applications. Avoiding overfitting to the seen 
data in a machine learning setup, preservation of privacy and 
security in multimedia or medical data management and elim¬ 
inating the computational cost of codebook design are among 
the advantages of this approach. 

To this end, equations dell & (0 should be considered in 
random codebook design. Fig. [3] shows the effect of normal¬ 
ization of codewords in a codebook on the achieved distortion 
and also orthogonality as measured by E[(^{X^ — 
for different codebook variances. 



(a) distortion vs. variance (b) orthogonality vs. variance 

Fig. 3: The effect of codebook normalization on the achieved 
distortion and orthogonality for a zero-mean Gaussian source 
in T^’^with n = 200 and = 1 and a randomly generated 
codebook with k = S with varying variance. 

As is shown in the figure, the empirical optimum for n = 
200 is not far from the theoretical optimum when n —> oo, the 
difference being due to geometrical variations of the n-sphere 
for different values of n. 

In fact, by proper normalization, as dictated by these 
equations, we show that we can get very close to the theoret¬ 
ical distortion-rate limit of equation ^ for moderate values 
of n. 

Fig. [4] shows the achieved distortion for i.i.d. source X^ 
with n = 512 synthesized from X ^ A/'(0,1) for different 
compression rates. We consider the compression using two 
different sets of codebooks. First is a randomly generated 
i.i.d. Gaussian with Xi ^ A/’(0, Di-i — Di) for the layer 
with Dq = where Di-i is the average distortion of the 
(i — 1)* layer. 

The second set of codebooks are equiprobable binary 
codebooks with alphabet Xi = {dza^}, ai chosen to guaran¬ 
tee the same variance as the Gaussian case in each layer. 

As is seen from the figure, the achieved distortion-rate 
function, without the exponential complexity burdens of the 
shallow structure, closely approximates the behavior of the 
Shannon lower bound in equation (0. The difference with the 
theoretical limit is due both to the finite block length and the 
information loss due to the additive encoding, as explained in 



Fig. 4: Distortion-rate behavior of multi-stage additive struc¬ 
ture for an i.i.d. Gaussian source with Gaussian and binary 
codebooks. The data dimension was n = 512 and ki = 2^^ 
for all values of 1 < i < L = 200. Therefore, the rate at each 
stage was Ri 0.023 with an overall rate of i?c = 4.69. 

section [TTI 

Interestingly, the behavior of the two codebook design 
strategies is the same. The reason is due to the choice of very 
small rates at each stage. In fact, in an analogy with chan¬ 
nel coding, the dual problem of rate-distortion theory, one can 
verify that capacities of the Gaussian channel and binary sym¬ 
metric channel are very close at extremely small rates. 

This fact is of very much practical significance, since, if 
the rate selection and normalization is done properly, one does 
not have to worry about matching the distribution of the code¬ 
books with that of the source. Moreover, the memory storage 
of real valued codewords can be reduced to that of binary val¬ 
ues. 

5. FACIAL IMAGE COMPRESSION 

Due to its practical significance, image compression has 
become a very mature field of both research and technol¬ 
ogy. Among the existing methods of image compression, 
JPEG2000 is reported to be among the best existing algo¬ 
rithms used in practice (H with a very intricate structure to 
achieve a highly optimized trade-off between compression ra¬ 
tio and performance . However, since it is a general purpose 
codec, for applications where compression of a large amount 
of similar images is concerned, one could think of methods 
of compression to use the extra redundancy present due to the 
similarities of application images. Moreover, the JPEG2000 
codec is not capable of providing very high compression 
ratios while many applications would require images to be 
highly compressed and compromise quality for description 
efficiency. 














One important example for this scenario is the compres¬ 
sion of facial images. They are available in large quantities 
in big databases of police departments, organizations and en¬ 
tities with lots of employees and users. Efficient compres¬ 
sion of these images in terms of storage and computational 
complexity is very important since it will result immediately 
in more resources and thus providing services to more users. 
Moreover, in some applications, rather than quality and fine 
details, the recognition informativeness of facial images is of 
more importance. 

Apart from the very numerous literature in image com¬ 
pression, there has been several works on compression of fa¬ 
cial images. In 10, a facial compression scheme based on 
Vector Quantization was proposed where a considerable per¬ 
formance improvement over the JPEG2000 was reported at 
very low rates. However, this method needs detection of fa¬ 
cial features (sometimes manually) and alignment by geo¬ 
metrical transformation into a canonical form and also back¬ 
ground removal which makes it very sensitive to the required 
pre-processing. Within the same setup, an approach based 
on dictionary learning with the K-SVD algorithm was pro¬ 
posed in ID where a special dictionary was learned for every 
block location of the image. In another work Q, a facial im¬ 
age compression using Redundant Tree-Based Wavelet Trans¬ 
form (RTBWT) was used with the same pre-processing and a 
filtering-based post-processing to improve the quality of im¬ 
ages. In spite of their high performance in terms of PNSR, the 
problem with these approaches is that they rely very much on 
the alignment of images and they are less likely to generalize 
once the imaging setup is changed a bit. 

Another scheme was proposed in |0 where the authors 
propose a codec by using the Iteration Tuned and Aligned 
Dictionary (ITAD) to compress facial images where dictio¬ 
naries are tuned in every iteration of the pursuit algorithm 
used. A considerable compression performance gain is re¬ 
ported for a wider range of compression rates. However, the 
tree structure of the dictionaries will require a considerable 
storage. 

5.1. Multi-layer approximation of Images 

We apply the above framework to compress facial images. 
Images from the training set are divided into non-overlapping 
blocks and then gathered in a database. Without any special 
pre-processing, the blocks are vectorized and fed to the sim¬ 
ple k-means algorithm. The residual of quantization is fed to 
the next stages for further quantization. To avoid over-fitting, 
ki, the number of cluster centroids (codewords) at the i* layer 
is chosen such that the distortion of reconstruction of the test 
data is within a margin from the distortion of the reconstruc¬ 
tion of the training data. 

The encoding part consists of assigning to each image 
block a sequence of indices each taking values from 

an alphabet of ki codewords. Therefore, the Bits Per Pixel 


(BPP) value for the image will be Yli=i ^og 2 {ki)/ (6^) where 
b is the block size. This value could be reduced by the use of 
an entropy coding applied to indices where a probability table 
could be trained from the training set for each of the stages. 

The decoding part simply consists of table look-ups to 
read the values of the corresponding entropy-decoded se¬ 
quences of codewords for each block and their addition. 
This process could be done online and sequentially once the 
required bits for each stage is received. 

6. EXPERIMENTAL RESULTS FOR IMAGE 
COMPRESSION 

We used 2400 randomly chosen images (80% for training 
and 20% for testing) from the CroppedYaleB O database of 
cropped facial images with different lighting conditions. This 
is a difficult database for compression since the variation of 
lighting in images is very significant and shadows could ob¬ 
scure different parts of face in different images. Therefore, 
one cannot train highly specialized dictionaries for different 
locations. Moreover, unlike the databases used in the existing 
approaches, the background is completely removed from the 
faces and the algorithm cannot favor from the redundant areas 
common in all images. 

We used L = 20 layers of global codebooks with 
k = 256,128, 32 and 16 codewords at the first, second, third 
and forth consecutive five layers, respectively. As was pre¬ 
viously mentioned, the choice of these values should avoid 
over-fitting. As is understandable from the values chosen 
for ki, also as seen from Fig. [71 the latter stages have less 
correlated structure and tend to over-train more easily. 

Fig. [5al sketches the compression behavior of this experi¬ 
ment with that of the JPEG2000 codec in terms of PSNR(dB) 
for different values of Bits Per Pixel (BPP) and Fig. [5bl shows 
the rate-distortion performance. As is seen from this figure, 
the quality of the compressed facial images at very low rates 
is significantly superior to that of the JPEG2000. We used a 
simple arithmetic coding for the indices. 

7. CONCLUSIONS 

We presented a multi-layer data representation approach and 
justified its efficiency in terms of data fidelity, memory stor¬ 
age and computational complexity with information-theoretic 
arguments. We then used this approach for the application of 
image compression when the images belong to a certain class, 
in our experiments facial images. We showed that with its 
simple structure in the direct pixel domain which could still be 
improved in different ways in terms of the choice of codebook 
sizes, entropy coding used or post-processing to reduce the 
blocking artifact, significant performance boost was achieved 
in the very low rate regime, compared to the JPEG2000 codec. 




(a) (b) 

Fig. 5: average compression performance over 480 test fa¬ 
cial images of the CroppedYale set (a) comparison with 
JPEG2000 in terms of PSNR (b) distortion-rate behavior 
(train set size was 1920). 
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(a) original (b) proposed (bpp = 0.05) (c) JPEG2000 (bpp = 0.05) 



(d) original (e) proposed (bpp = 0.09) (f) JPEG2000 (bpp = 0.09) 

Fig. 6: Visual comparison of the proposed compression with 
the JPEG2000 codec over random images from the test set. 
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Fig. 7: Randomly selected codewords from two different lay¬ 
ers. 
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